Survey on Fault Tolerance Techniques on Grid
نویسندگان
چکیده
In a grid environment there are thousands of resources, services and applications that need to interact in order to make possible the use of the grid [1] as an execution platform. Since these elements are extremely heterogeneous, volatile and dynamic, there are many failure possibilities, including not only independent failures of each element, but also those resulting from interactions between them. Because of the inherent instability of grid environments, fault-detection and recovery is another critical component that must be addressed. The need for fault-tolerance is especially sensitive for large parallel applications since the failure rate grows with the number of processors and the duration of the computation.In this paper we will discuss the various fault management stratigies that will help to achieve the fault tolerance and is good reference to researcher. Keywords— Fault Tolerance, Grid Computing, Fault Prevention, Fault Avoidance, Fault Detection, Fault Recovery
منابع مشابه
Stability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملA Survey of Distributed Fault Tolerance Strategies
Grid computing is defined as geographically distributed, heterogeneity (different hardware, software and networks), resource sharing, multiple administrators, dependable access, and Pervasive access within dynamic organizations. In grid computing, the rate of failure is much greater than in traditional parallel computing. Therefore, the fault tolerance is an important property in order to achie...
متن کاملA Survey on Task Checkpointing and Replication based Fault Tolerance in Grid Computing
A grid is a distributed computational and storage environment often composed of heterogeneous autonomously managed subsystems. As a result, varying resource availability becomes commonplace, often resulting in loss and delay of executing jobs. To ensure good grid performance, fault tolerance should be taken in to account. Commonly utilized techniques for providing fault tolerance in distributed...
متن کاملFault tolerance in computational grids: perspectives, challenges, and issues
Computational grids are established with the intention of providing shared access to hardware and software based resources with special reference to increased computational capabilities. Fault tolerance is one of the most important issues faced by the computational grids. The main contribution of this survey is the creation of an extended classification of problems that incur in the computation...
متن کاملA Survey on Fault Tolerance Mechanisms for job scheduling in Grid computing
Grid computing is defined as a hardware and software infrastructure that enables sharing of coordinated resources in a dynamic environment. In grid computing, the probability of a failure is much greater than parallel computing. Therefore, the fault tolerance is an important issue in order to achieve reliability, availability of resources. When scheduling a job, the resource uses both average f...
متن کامل